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University of Nottingham 

The statistical analysis of covariance matrix data is considered 
and, in particular, methodology is discussed which takes into account 
the non-Euclidean nature of the space of positive semi-definite sym- 
metric matrices. The main motivation for the work is the analysis of 
diffusion tensors in medical image analysis. The primary focus is on 
estimation of a mean covariance matrix and, in particular, on the use 
of Procrustes size-and-shape space. Comparisons are made with other 
estimation techniques, including using the matrix logarithm, matrix 
square root and Cholesky decomposition. Applications to diffusion 
tensor imaging are considered and, in particular, a new measure of 
fractional anisotropy called Procrustes Anisotropy is discussed. 



1. Introduction. The statistical analysis of covariance matrices occurs 
in many important applications, for example, in diffusion tensor imaging 
[Alexander (2005); Schwartzman, Dougherty and Taylor (2008)] or longitu- 
dinal data analysis [Daniels and Pourahmadi (2002)]. We wish to consider 
the situation where the data at hand are sample covariance matrices, and 
we wish to estimate the population covariance matrix and carry out statisti- 
cal inference. An example application is in diffusion tensor imaging where a 
diffusion tensor is a covariance matrix related to the molecular displacement 
at a particular voxel in the brain, as described in Section 2. 

If a sample of covariance matrices is available, we wish to estimate an 
average covariance matrix, or we may wish to interpolate in space between 
two or more estimated covariance matrices, or we may wish to carry out 
tests for equality of mean covariance matrices in different groups. 
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The usual approach to estimating mean covariance matrices in Statistics is 
to assume a scaled Wishart distribution for the data, and then the maximum 
likelihood estimator (m.l.e.) of the population covariance matrix is the arith- 
metic mean of the sample covariance matrices. The estimator can be formu- 
lated as a least squares estimator using Euclidean distance. However, since 
the space of positive semi-definite symmetric matrices is a non-Euclidean 
space, it is more natural to use alternative distances. In Section 3 we define 
what is meant by a mean covariance matrix in a non-Euclidean space, us- 
ing the Frechet mean. We then review some recently proposed techniques 
based on matrix logarithms and also consider estimators based on matrix 
decompositions, such as the Cholesky decomposition and the matrix square 
root. 

In Section 4 we introduce an alternative approach to the statistical anal- 
ysis of covariance matrices using the Kendall's (1989) size-and-shape space. 
Distances, minimal geodesies, sample Frechet means, tangent spaces and 
practical estimators based on Procrustes analysis are all discussed. We in- 
vestigate properties of the estimators, including consistency. 

In Section 5 we compare the various choices of metrics and their prop- 
erties. We investigate measures of anisotropy and discuss the deficient rank 
case in particular. We consider the motivating applications in Section 6 
where the analysis of diffusion tensor images and a simulation study are 
investigated. Finally, we conclude with a brief discussion. 

2. Diffusion tensor imaging. In medical image analysis a particular type 
of covariance matrix arises in diffusion weighted imaging called a diffusion 
tensor. The diffusion tensor is a 3 x 3 covariance matrix which is estimated 
at each voxel in the brain, and is obtained by fitting a physically-motivated 
model on measurements from the Fourier transform of the molecule displace- 
ment density [Basser, Mattiello and Le Bihan (1994); Alexander (2005)]. 

In the diffusion tensor model the water molecules at a voxel diffuse accord- 
ing to a multivariate normal model centered on the voxel and with covari- 
ance matrix E. The displacement of a water molecule x G R 3 has probability 
density function 



The convention is to call D = E/2 the diffusion tensor, which is a symmetric 
positive semi-definite matrix. The diffusion tensor is estimated at each voxel 
in the image from the available MR images. The MR scanner has a set 
of magnetic field gradients applied at directions g\ , g2 , • • • , g m £ RP 2 with 
scanner gradient parameter b, where RP 2 is the real projective space of 
axial directions (with gj = —gj, \\gj\\ = 1). The data at a voxel consist of 
signals (Zq, Z±, . . . , Z m ) which are related to the Fourier transform of the 
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Fig. 1. Visualization of a diffusion tensor as an ellipsoid. The principal axis is also 
displayed. 

displacements in axial direction gj G RP 2 ,j = 1, . . .,m, and the reading Zq 
is obtained with no gradient (6 = 0). The Fourier transform in axial direction 
g £ RP 2 of the multivariate Gaussian displacement density is given by 

F(9) = J exp(iVbg)f(x) dx = ex.p(-bg T Dg), 

and the theoretical model for the signals is 

Zj = Z T{gj) = Z exp(-bgjDgj), j = l,...,m. 

There are a variety of methods available for estimating D from the data 
(Zq, Zi,..., Z m ) at each voxel [see Alexander (2005)], including least squares 
regression and Bayesian estimation [e.g., Zhou et al. (2008)]. Noise models 
include log-Gaussian, Gaussian and, more recently, Rician noise [Wang et al. 
(2004); Fillard et al. (2007); Basu, Fletcher and Whitaker (2006)]. A common 
method for visualizing a diffusion tensor is an ellipsoid with principal axes 
given by the eigenvectors of D, and lengths of axes proportional to = 
1,2,3. An example is given in Figure 1. 

If a sample of diffusion tensors is available, we may wish to estimate an 
average diffusion tensor matrix, investigate the structure of variability in 
diffusion tensors or interpolate at higher spatial resolution between two or 
more estimated diffusion tensor matrices. 

In diffusion tensor imaging a strongly anisotropic diffusion tensor indicates 
a strong direction of white matter fiber tracts, and plots of measures of 
anisotropy are very useful to neurologists. A measure that is very commonly 
used in diffusion tensor imaging is Fractional Anistropy, 

f k k I k 1 V2 

where < FA < 1 and Aj are the eigenvalues of the diffusion tensor matrix. 
Note that FA ~ 1 if Ai ^> Aj ~ 0,2 > 1 (very strong principal axis) and 
FA = for isotropy. In diffusion tensor imaging k = 3. 



4 



I. L. DRYDEN, A. KOLOYDENKO AND D. ZHOU 




Fig. 2. An FA map from a slice in a human brain. Lighter values indicate higher FA. 

In Figure 2 we see a plot of FA from an example healthy human brain. We 
focus on the small inset region in the box, and we would like to interpolate 
the displayed image to a finer scale. We return to this application in Section 
6.3. 

3. Covariance matrix estimation. 

3.1. Euclidean distance. Let us consider n sample covariance matrices 
(symmetric and positive semi-definite k x k matrices) S\,...,S n which are 
our data (or sufficient statistics). We assume that the Si are independent 
and identically distributed (i.i.d.) from a distribution with mean covariance 
matrix E, although we shall elaborate more later in Section 3.2 about what 
is meant by a "mean covariance matrix." The main aim is to estimate E. 
More complicated modeling scenarios are also of interest, but for now we 
just concentrate on estimating the mean covariance matrix E. 

The most common approach is to assume i.i.d. scaled Wishart distribu- 
tions for Si with E(Si) = E, and the m.l.e. for E is Eg = -Ya^i^i- This 
estimator can also be obtained if using a least squares approach by minimiz- 
ing the sum of square Euclidean distances. The Euclidean distance between 
two matrices is given by 

(1) d E (S l ,S 2 ) = ||5i - S 2 \\ = ^trace{(Si - S 2 ) rr (S l - 5 2 )}, 

where \\X\\ = J trace (X T X) is the Euclidean norm (also known as the Frobe- 
nius norm). The least squares estimator is given by 

n 

E^ = arginf ^ - E|| 2 . 
s i=i 

However, the space of positive semi-definite symmetric matrices is a non- 
Euclidean space and other choices of distance are more natural. One par- 
ticular drawback with Euclidean distance is when extrapolating beyond the 
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data, nonpositive semi-definite estimates can be obtained. There are other 
drawbacks when interpolating covariance matrices, as we shall see in our 
applications in Section 6. 

3.2. The Frechet mean. When using a non-Euclidean distance d(-) we 
must define what is meant by a "mean covariance matrix." Consider a prob- 
ability distribution for a k x k covariance matrix S on a Riemannian metric 
space with density f(S). The Frechet (1948) mean E is defined as 



£ = arginfi|d(S,£) 2 /(S)dS, 



and is also known as the Karcher mean [Karcher (1977)]. The Frechet mean 
need not be unique in general, although for many distributions it will be. 
Provided the distribution is supported only on the geodesic ball of radius r, 
such that the geodesic ball of radius 2r is regular [i.e., supremum of sectional 
curvatures is less than (7r/(2r)) 2 ], then the Frechet mean E is unique [Le 
(1995)]. The support to ensure uniqueness can be very large. For example, 
for Euclidean spaces (with sectional curvature zero), or for non-Euclidean 
spaces with negative sectional curvature, the Frechet mean is always unique. 

If we have a sample Si,...,S n of i.i.d. observations available, then the 
sample Frechet mean is calculated by finding 

n 

£ = arginf d(Si, £) 2 . 
E i=i 

Uniqueness of the sample Frechet mean can also be determined from the 
result of Le (1995). 

3.3. Non-Euclidean covariance estimators. A recently derived approach 
to covariance matrix estimation is to use matrix logarithms. We write the 
logarithm of a positive definite covariance matrix S as follows. Let S = 
UAU T be the usual spectral decomposition, with U E 0(k) an orthogonal 
matrix and A diagonal with strictly positive entries. Let log A be a diagonal 
matrix with logarithm of the diagonal elements of A on the diagonal. The 
logarithm of S is given by logS = J7(log A)C7 T and likewise the exponential 
of the matrix S is exp S = f/(expA)[/ T . Arsigny et al. (2007) propose the 
use of the log-Euclidean distance, where Euclidean distance between the 
logarithm of covariance matrices is used for statistical analysis, that is, 

(2) di(5i,52) = ||log(5 1 )-log(5 2 )||. 

An estimator for the mean population covariance matrix using this ap- 
proach is given by 

t L = exp | arginf ^ ||logSj - log£|| 2 j = expj ^ ^ logS* j. 
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Using this metric avoids extrapolation problems into matrices with nega- 
tive eigenvalues, but it cannot deal with positive semi-definite matrices of 
deficient rank. 

A further logarithm-based estimator uses a Riemannian metric in the 
space of square symmetric positive definite matrices 

(3) d R (Sx,S 2 ) = ||log(S- 1/2 S 2 S- 1/2 )||. 
The estimator (sample Frechet mean) is given by 

n 

S R = arginf^||log(5.r 1 / 2 S5r 1 / 2 )|| 2 , 

which has been explored by Pennec, Fillard and Ayache (2006), Moakher 

(2005) , Schwartzman (2006), Lenglet, Rousson and Deriche (2006) and 
Fletcher and Joshi (2007). The estimate can be obtained using a gradi- 
ent descent algorithm [e.g., see Pennec (1999); Pennec, Fillard and Ayache 

(2006) ]. Note that this Riemannian metric space has negative sectional cur- 
vature and so the population and sample Frechet means are unique in this 
case. 

Alternatively, one can use a reparameterization of the covariance matrix, 
such as the Cholesky decomposition [Wang et al. (2004)], where Si = LiLj 
and Li = chol(5j) is lower triangular with positive diagonal entries. The 
Cholesky distance is given by 

(4) d c (S 1 ,S 2 ) = ||chol(5i) - chol(5 2 )||. 
A least squares estimator can be obtained from 

f 1 n 1 1 n 

£ C = A C A£, where A c = arginN - V||^- A|| 2 > = ~Y l L i . 

\ i=i ) i=i 

An equivalent model-based approach would use an independent Gaussian 
perturbation model for the lower triangular part of Lj, with mean given by 
the lower triangular part of Ac*, and so Ac is the m.l.e. of Ac under this 
model. Hence, in this approach the averaging is carried out on a square root 
type-scale, which would indeed be the case for k = 1 dimensional case where 
the estimate of variance would be the square of the mean of the sample 
standard deviations. 

An alternative decomposition is the matrix square root where S 1 / 2 = 
UA 1 / 2 U 1 - , which has not been used in this context before as far as we are 
aware. The distance is given by 

(5) d H (S 1 ,S 2 ) = \\sl /2 -S 1 2 /2 \\. 
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A least squares estimator can be obtained from 

{n ~\ -. n 

£||S, 1/2 -A|| 2 =^E^ /2 - 
i=l ) i=l 

However, because LiRR T Lj = L{Lj for R E O(k), another new alterna- 
tive is to relax the lower triangular or square root parameterizations and 
match the initial decompositions closer in terms of Euclidean distance by 
optimizing over rotations and reflections. This idea provides the rationale 
for the main approaches in this paper. 

4. Procrustes size-and-shape analysis. 

4.1. Non-Euclidean size-and-shape metric. The non-Euclidean size-and- 
shape metric between two k x k covariance matrices Si and 52 is defined 
as 

(6) d s (Si,S 2 )= inf \\L t -L 2 R\\, 

where Li is a decomposition of Si such that Si = L{Lj, i = 1, 2. For example, 
we could have the Cholesky decomposition Li = chol(5j),i = 1,2, which is 
lower triangular with positive diagonal elements, or we could consider the 
matrix square root L = S 1 / 2 = UA 1 / 2 U T , where S = UAU T is the spectral 
decomposition. Note that Si = (LiR)(LiR) T for any R E 0(k), and so the 
distance involves matching Li optimally, in a least-squares sense, to L 2 by 
rotation and reflection. Since S = LL T , then the decomposition is repre- 
sented by an equivalence class {LR:R E 0(k)}. For practical computation 
we often need to choose a representative from this class, called an icon, and 
in our computations we shall choose the Cholesky decomposition. 
The Procrustes solution for matching L 2 to Li is 

R = arg inf \\Li — L 2 R\\ 

(7) 

= UW T , where Lj L 2 = WAU T ,U,W G O(k), 

and A is a diagonal matrix of positive singular values [e.g., see Mardia, Kent 
and Bibby (1979), page 416]. 

This metric has been used previously in the analysis of point set configura- 
tions where invariance under translation, rotation and reflection is required. 
Size-and-shape spaces were introduced by Le (1988) and Kendall (1989) as 
part of the pioneering work on the shape analysis of landmark data [cf. 
Kendall (1984)]. The detailed geometry of these spaces is given by Kendall 
et al. [(1999), pages 254-264], and, in particular, the size-and-shape space is 
a cone with a warped-product metric and has positive sectional curvature. 
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Equation (6) is a Riemannian metric in the reflection size-and-shape space 
of (k + l)-points in k dimensions [Dryden and Mardia (1998), Chapter 8]. 
In particular, dg(-) is the reflection size-and-shape distance between the 
(k + 1) x k configurations H T L\ and H T L 2 , where H is the k x (k, + 1) 
Helmert sub-matrix [Dryden and Mardia (1998), page 34] which has jth 
row given by 

(hj,..., hj , -jhj , 0^0 ) , hj = + 1)}- 1/2 , 

j times k—j times 

for j = 1, . . . , k. 

Hence, the statistical analysis of covariance matrices can be considered 
equivalent to the dual problem of analyzing reflection size-and-shapes. 



4.2. Minimal geodesic and tangent space. Let us consider the minimal 
geodesic path through the reflection size-and-shapes of L\ and L 2 in the 
reflection size-and-shape space, where LiLj = Si, i = 1,2. Following an ar- 
gument similar to that for the minimal geodesies in shape spaces [Kendall et 
al. (1999)], this minimal geodesic can be isometrically expressed as L\ + tT, 
where T are the horizontal tangent co-ordinates of L 2 with pole L\. Kendall 
et al. [(1999), Section 11.2] discuss size-and-shape spaces without reflection 
invariance, however, the results with reflection invariance are similar, as re- 
flection does not change the local geometry. 

The horizontal tangent coordinates satisfy LiT T = TLj [Kendall et al. 
(1999), page 258]. Explicitly, the horizontal tangent coordinates are given 
by 

T = L 2 R-L 1 , R = inf \\L 1 -L 2 R\\, 

ReO(k) 

where R is the Procrustes match of L 2 onto L\ given in (7). So, the geodesic 
path starting at L\ and ending at L 2 is given by 

W1L1 + w 2 L 2 R, 

where w\ + w 2 = 1, Wi > 0, i = 1, 2, and R is given in (7). Minimal geodesies 
are useful in applications for interpolating between two covariance matrices, 
in regression modeling of a series of covariance matrices, and for extrapola- 
tion and prediction. 

Tangent spaces are very useful in practical applications, where one uses 
Euclidean distances in the tangent space as approximations to the non- 
Euclidean metrics in the size-and-shape space itself. Such constructions are 
useful for approximate multivariate normal based inference, dimension re- 
duction using principal components analysis and large sample asymptotic 
distributions. 
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4.3. Procrustes mean covariance matrix. Let Si,...,S n be a sample of 
n positive semi-definite covariance matrices each of size k x k from a dis- 
tribution with density f(S), and we work with the Procrustes metric (6) in 
order to estimate the Frechet mean covariance matrix S. We assume that 
f(S) leads to a unique Frechet mean (see Section 3.2). 

The sample Frechet mean is calculated by finding 

n 

S 5 = arginf^d 5 (5i,S) 2 . 
^ i=i 

In the dual size-and-shape formulation we can write this as 

n 

(8) S S = A S A|, where A s = arginfV inf \\H T URi - H T A\\ 2 . 

The solution can be found using the Generalized Procrustes Algorithm 
[Gower (1975); Dryden and Mardia (1998), page 90], which is available in 
the shapes library (written by the first author of this paper) in R [R De- 
velopment Core Team (2007)]. Note that if the data lie within a geodesic 
ball of radius r such that the geodesic ball of radius 2r is regular [Le (1995); 
Kendall (1990)], then the algorithm finds the global unique minimum solu- 
tion to (8). This condition can be checked for any dataset and, in practice, 
the algorithm works very well indeed. 

4.4. Tangent space inference. If the variability in the data is not too 
large, then we can project the data into the tangent space and carry out the 
usual Euclidean based inference in that space. 

Consider a sample S± , . . . , S n of covariance matrices with sample Frechet 
mean £5 and tangent space coordinates with pole Ss = AgAg given by 

Vi = As-URi, 

where Ri is the Procrustes rotation for matching Li to Ag, i = 1, . . . , n, and 
S{ = LiLf , i = 1, 2. 

Frequently one wishes to reduce the dimension of the problem, for exam- 
ple, using principal components analysis. Let 

1 n 

S v = -Vvec(l/i)vec(Vi) T , 

n r— f 

1=1 

where vec is the vectorize operation. The principal component (PC) load- 
ings are given by = 1, . . . ,p, the eigenvectors of S v corresponding to 
the eigenvalues Ai > A2 > • • • > A p > 0, where p is the number of nonzero 
eigenvalues. The PC score for the ith individual on PC j is given by 

Sij = ijvec(Vi), i = 1, . . . ,n,j = 1, . . . ,p. 
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In general, p = min(n — \,k{k + l)/2). The effect of the jth PC can be 
examined by evaluating 

E(c) = (A S + cvec fc - 1 (A J 1/2 7,))(A 5 + cvec^ 1 (Xf^f 

for various c [often in the range c G (—3, 3), for example], where vec^" 1 (vec(y)) 
V for a k x k matrix V . 

Tangent space inference can proceed on the first p PC scores, or possibly 
in lower dimensions if desired. For example, Hotelling's T 2 test can be carried 
out to examine group differences, or regression models could be developed 
for investigating the PC scores as responses versus various covariates. We 
shall consider principal components analysis of covariance matrices in an 
application in Section 6.2. 

4.5. Consistency. Le (1995, 2001) and Bhattacharya and Patrangenaru 
(2003, 2005) provide consistency results for Riemannian manifolds, which 
can be applied directly to our situation. Consider a distribution F on the 
space of covariance matrices which has size-and-shape Frechet mean Eg. Let 
Si,. . . ,S n be i.i.d. from F, such that they lie within a geodesic ball B r such 
that -E>2r is regular. Then 

p 

Y*s — > Eg, as n ^ oo, 

where Eg is unique. In addition, we can derive a central limit theorem result 
as in Bhattacharya and Patrangenaru (2005), where the tangent coordinates 
have an approximate multivariate normal distribution for large n. Hence, 
confidence regions based on the bootstrap can be obtained, as in Amaral, 
Dryden and Wood (2007) and Bhattacharya and Patrangenaru (2003, 2005). 

4.6. Scale invariance. In some applications it may be of interest to con- 
sider invariance over isotropic scaling of the covariance matrix. In this case 
we could consider the representation of the covariance matrix using Kendall's 
reflection shape space, with the shape metric given by the full Procrustes 
distance 

Li 



(9) d F (Si,S 2 )= inf 

R£O(k),/3>0 



@L 2 R 



where Si = LiLj ,i = 1,2, and (3 > is a scale parameter. Another choice 
of the estimated covariance matrix from a sample Si, . . . , S n , which is scale 
invariant and based on the full Procrustes mean shape (extrinsic mean), is 

n 

£ F = A F AZ, where A F = arg inf V( inf WPiURi - All 2 }, 
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Table 1 

Notation and definitions of the distances and estimators 



Name 


Notation 


Form 


Estimator 


Equation 


Euclidean 


(1e{Si,S2) 


IIA-^H 


£ E 


(1) 


Log-Euclidean 


di,(S\, S 2 ) 


||log(5i)-log(S 2 )]| 


El 


(2) 


Riemannian 


dn(Sl,S 2 ) 


iiio g (5r 1/2 s 2 5- l/2 )ii 


£r 


(3) 


Cholesky 


d c (S u S 2 ) 


||chol(Si) -chol(S 2 )|| 


t C 


(1) 


Root Euclidean 


d„{Si,S 2 ) 




±H 


(5) 


Procrustes size-and-shape 


ds(Si,S 2 ) 


inffleo(fe) \\Li - L 2 R\\ 


Es 


(6) 


Full Procrustes shape 


d F (S u S 2 ) 


infii e o(fe),/3>o II p^hf - PL 2 R\\ 


Ef 


(9) 


Power Euclidean 


d A (Si,Sa) 


i\\s?-s%\\ 


t A 


(10) 



and Si = L{Lj ,i = 1, . . . , n, and Pi > are scale parameters. The solution can 
again be found from the Generalized Procrustes Algorithm using the shapes 
library in R. Tangent space inference can then proceed in an analogous 
manner to that of Section 4.4. 

5. Comparison of approaches. 

5.1. Choice of metrics. In applications there are several choices of dis- 
tances between covariance matrices that one could consider. For complete- 
ness we list the metrics and the estimators considered in this paper in Table 
1, and we discuss briefly some of their properties. 

Estimators Eg, £c, E#, Er,, Sa are straightforward to compute using arith- 
metic averages. The Procrustes based estimators E^E^ involve the use of 
the Generalized Procrustes Algorithm, which works very well in practice. 
The Riemannian metric estimator £r uses a gradient descent algorithm 
which is guaranteed to converge [see Pennec (1999); Pennec, Fillard and 
Ayache (2006)]. 

All these distances except dc are invariant under simultaneous rotation 
and reflection of S\ and S2, that is, the distances are unchanged by replacing 
both Si by VSiV T ,Ve 0(k),i= 1,2. Metrics d L {-),d R (-),d F (-) are invari- 
ant under simultaneous scaling of Si,i = 1,2, that is, replacing both Si by 
/3Si, (5 > 0. Metric dji(-) is also affine invariant, that is, the distances are 
unchanged by replacing both Si by ASiA T ,i = 1,2, where A is a general 
k x k full rank matrix. Metrics g?l( - )> ^.r(') have the property that 

d(A,I k ) = d(A-\l k ), 

where 1^ is the k x k identity matrix. 

Metrics di,{ m ),dii(-),dF(-) are not valid for comparing rank deficient co- 
variance matrices. Finally, there are problems with extrapolation with met- 
ric g?s(-): extrapolate too far and the matrices are no longer positive semi- 
definite. 
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5.2. Anisotropy. In some applications a measure of anisotropy of the 
covariance matrix may be required, and in Section 2 we described the com- 
monly used FA measure. An alternative is to use the full Procrustes shape 
distance to isotropy and we have 



PA 



k 



k-1 



d F (h,S) 



k 



inf 



k-1 R£0{k),f3eR^ 

, k iV2 



-j=-Pch.ol(S)R 

v K 



tE^-Va) 2 /^ 

i=i ' i=i 



where \f\ = hYl VK- Note that the maximal value of dp distance from 
isotropy to the rank 1 covariance matrix is y/(k — 1) /k, which follows from Le 
(1992). We include the scale factor when defining the Procrustes Anisotropy 
(PA) , and so < PA < 1 , with PA = indicating isotropy, and PA ~ 1 indi- 
cating a very strong principal axis. 

A final measure based on metrics di or dn is the geodesic anisotropy 



GA= ^log^-logA) 5 



1/2 



where < GA < oo [Arsigny et al. (2007); Fillard et al. (2007); Fletcher and 
Joshi (2007)], which has been used in diffusion tensor analysis in medical 
imaging with k = 3. 



5.3. Deficient rank case. In some applications covariance matrices are 
close to being deficient in rank. For example, when FA or PA are equal 
to 1, then the covariance matrix is of rank 1. The Procrustes metrics can 
easily deal with deficient rank matrices, which is a strong advantage of the 
approach. Indeed, Kendall's (1984, 1989) original motivation for developing 
his theory of shape was to investigate rank 1 configurations in the context 
of detecting "flat" (collinear) triangles in archeology. 

The use of and has strong connections with the use of Bookstein's 
(1986) hyperbolic shape space and Le and Small's (1999) simplex shape 
space, and such spaces cannot deal with deficient rank configurations. 

The use of the Cholesky decomposition has strong connections with Book- 
stein coordinates and Goodall-Mardia coordinates in shape analysis, where 
one registers configurations on a common baseline [Bookstein (1986); Goodall 
and Mardia (1992)]. For small variability the baseline registration method 
and Procrustes superimposition techniques are similar, and there is an ap- 
proximate linear relationship between the two [Kent (1994)]. In shape anal- 
ysis edge superimposition techniques can be very unreliable if the baseline 
is very small in length, which would correspond to very small variability in 



NON-EUCLIDEAN STATISTICS FOR COVARIANCE MATRICES 



13 



^ o o 

^ ^ o o Q I 

^ & o ^ ^ ^ 

Fig. 3. Four different geodesic paths between the two tensors. The geodesic paths are 
obtained using cIe(-) (1st row), di(-) (2nd row), dc(-) (3rd row) and ds(-) (J^th row). 

particular diagonal elements of the covariance matrix in the current context. 
Cholesky methods would be unreliable in such cases. Also, Bookstein coor- 
dinates induce correlations in the shape variables and, hence, estimation of 
covariance structure is biased [Kent (1994)]. Hence, in general, Procrustes 
techniques are preferred over edge superimposition techniques in shape anal- 
ysis. Hence, this would mean in the current context that the Procrustes ap- 
proaches of this paper should be preferred to inference using the Cholesky 
decomposition. 

6. Applications. 

6.1. Interpolation of covariance matrices. Frequently in diffusion tensor 
imaging one wishes to carry out interpolation between tensors. When the 
tensors are quite different, interpolation using different metrics can lead to 
very different results. For example, consider Figure 3, where four different 
geodesic paths are plotted between two tensors. Arsigny et al. (2007) note 
that the Euclidean metric is prone to swelling, which is seen in this example. 
Also, the log-Euclidean metric gives strong weight to small volumes. In this 
example the Cholesky and Procrustes size-and-shape paths look rather dif- 
ferent, due to the extra rotation in the Procrustes method. From a variety of 
examples it does seem clear that the Euclidean metric is very problematic, 
especially due to the swelling of the volume. In general, the log-Euclidean 
and Procrustes size-and-shape methods seem preferable. 

In some applications, for example, fiber tracking, we may need to inter- 
polate between several covariance matrices on a grid, in which case we can 
use weighted Frechet means 

n n 

t, = argmf^2wid(Si,T,) 2 , ^^ = 1, 
s i=i i=i 
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Fig. 4. Demonstration of PC A for covariance matrices. The true geodesic path is given 
in the penultimate row (black). We then add noise in the three initial rows (red). Then we 
estimate the mean and find the first principal component (yellow), displayed in the bottom 
row. 



where the weights Wi are proportional to a function of the distance (e.g., 
inverse distance or Kriging based weights). 

6.2. Principal components analysis of diffusion tensors. We consider now 
an example estimating the principal geodesies of the covariance matrices 
Si,...,S n using the Procrustes size-and-shape metric. The data are dis- 
played in Figure 4 and here k = 3. We consider a true geodesic path (black) 
and evaluate 11 equally spaced covariance matrices along this path. We 
then add noise for three separate realizations of noisy paths (in red). The 
noise is independent and identically distributed Gaussian and is added in 
the dual space of the tangent coordinates. First, the overall Procrustes size- 
and-shape mean Eg is computed based on all the data (n = 33), and then 
the Procrustes size-and-shape tangent space co-ordinates are obtained. The 
first principal component loadings are computed and projected back to give 
an estimated minimal geodesic in the covariance matrix space. We plot this 
path in yellow by displaying 11 covariance matrices along the path. As we 
would expect, the first principal component path bears a strong similarity 
to the true geodesic path. The percentages of variability explained by the 
first three PCs are as follows: PCI (72.0%), PC2 (8.8%), PC3 (6.5%). 

The data can also be seen in the dual Procrustes space of 4 points in k = 3 
dimensions in Figure 5. We also see the data after applying the Procrustes 
fitting, we show the effects of the first three principal components, and also 
the plot of the first three PC scores. 
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Fig. 5. (top left) The noisy configurations in the dual space of k + 1 = 4 points in k = 3 
dimensions. For each configuration point 1 is colored black, point 2 is red, point 3 is green 
and point 4 is blue, and the points in a configuration are joined by lines, (top right) The 
Procrustes registered data, after removing translation, rotation and reflection, (bottom left) 
The Procrustes mean size- and- shape, with vectors drawn along the directions of the first 
three PCs (PCI —black, PC2—red, PCS — green), (bottom right) The first three PC scores. 
The points are colored by the position along the true geodesic from left to right (black, red, 
green, blue, cyan, purple, yellow, grey, black, red, green). 



6.3. Interpolation. We consider the interpolation of part of the brain 
image in Figure 2. In Figure 6(a) we see the original FA image, and in Figure 
6(b) and (c) we see interpolated images using size-and-shape distance. The 
interpolation is carried out at two equally spaced points between voxels, 
and Figure 6(b) shows the FA image from the interpolation and Figure 6(c) 
shows the PA image. In the bottom right plot of Figure 6 we highlight 
the selected regions in the box. It is clear that the interpolated images are 
smoother, and it is clear from the anisotropy maps of the interpolated data 
that the cingulum (eg) is distinct from the corpus callosum (cc). 

6.4. Anisotropy. As a hnal application we consider some diffusion ten- 
sors obtained from diffusion weighted images in the brain. In Figure 7 we 
see a coronal slice from the brain with the 3x3 tensors displayed. This 
image is a coronal view of the brain, and the corpus callosum and cingulum 
can be seen. The diagonal tract on the lower left is the anterior limb of the 
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Fig. 6. FA maps from the original (a) and interpolated (b) data. In (c) the PA map 
is displayed, and in (al), (bl), (cl) we see the zoomed in regions marked in (a), (b), (c) 
respectively. 

internal capsule and on the lower right we see the superior fronto-occipital 
fasciculus. 
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Fig. 7. /n i/ie upper plots we see the anisotropy measures (left) FA, (middle) PA, (right) 
GA. In the lower plot we see the diffusion tensors, which have been scaled to have volume 
proportional to \J FA. 
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At first sight all three measures appear broadly similar. However, the 
PA image offers more contrast than the FA image in the highly anisotropic 
region — the corpus callosum. Also, the GA image has rather fewer brighter 
areas than PA or FA. Due to the improved contrast, we believe PA is slightly 
preferable in this example. 

6.5. Simulation study. Finally, we consider a simulation study to com- 
pare the different estimators. We consider the problem of estimating a pop- 
ulation covariance matrix £1 from a random sample of k x k covariance ma- 
trices S\,..., S n . 

We consider a random sample generated as follows. Let A = chol(O) and 
Xi be a random matrix with i.i.d. entries with E[(Xi)ji\ = 0, var((JTj),7) = 
a 2 , i = 1, . . . ,n; j = 1, . . . , k; I = 1, . . . , k. We take 

S i = (A + X i )(A + X i f, i = l,...,n. 

We shall consider four error models: 

I. Gaussian square root: (Xi)ji are i.i.d. N(0, a 2 ) for j = 1, . . . , k; I = 1, . . . , k 
II. Gaussian Cholesky: (Xi)ji are i.i.d. N(0,a 2 ) for j < k and zero other- 
wise. 

III. Log-Gaussian: i.i.d. Gaussian errors iV(0,(T 2 ) are added to the matrix 
logarithm of A to give Y, and then the matrix exponential of YY^ is 
taken. 

IV. Student's t with 3 degrees of freedom: (Xi)ji are i.i. d. (ct/\/3>3 for 
j = 1, . . . , k; I = 1, . . . , k. 

We consider the performance in a simulation study, with 1000 Monte 
Carlo simulations. The results are presented in Tables 2 and 3 for two choices 
of population covariance matrix. We took k = 3 and n = 10, 30. In order to 
investigate the efficiency of the estimators, we use three measures: estimated 
mean square error between the estimate and the matrix with metrics 
dE(-),ds(-) and the estimated risk from using Stein loss [James and Stein 
(1961)] which is given by 

L(S±,S2) = trace(S'i5^" 1 ) — logdet(5iS'^~ 1 ) — k, 

where det(-) is the determinant. Clearly the efficiency of the methods de- 
pends strongly on the O and the error distribution. 

Consider the first case where the mean has Ai = 1, A2 = 0.3, A3 = 0.1 in 
Table 2. We discuss model I first where the errors are Gaussian on the 
matrix square root scale. The efficiency is fairly similar for each estimator 
for n = 10, with Y<h performing the best. For n = 30 either or Eg are 
better, with J^e performing least well. For model II with Gaussian errors 
added in the Cholesky decomposition we see that T,q is the best, although 
the other estimators are quite similar, with the exception of which is 
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worse. For model III with Gaussian errors on the matrix logarithm scale all 
estimators are quite similar, as the variability is rather small. The estimate 
T,r is slightly better here than the others. For model IV with Student's £3 
errors we see that E# and Eg are slightly better on the whole, although Y<e 
is again the worst performer. 



Table 2 

Measures of efficiency, with k = 3 and a = 0.1. RMSE is the root mean square error 
using either the Euclidean norm or the Procrustes size- and- shape norm, and "Stein" 
refers to the risk using the Stein loss function. The smallest value in each row is 
highlighted in bold. The mean has parameters Ai = 1, A2 = 0.3, A3 =0.1. The error 
distributions for Models I-IV are Gaussian (square root), Gaussian (Cholesky), 
log-Gaussian and Student's £3, respectively 



S_e Sc Ss Sff S_t £r Sf 



I 



II 



/) 


= 10 


RMSE(d B ) 


0.1136 


0.1057 


0.104 


0.1025 


0.104 


0.1176 


0.1058 






RMSE(ds) 


0.0911 


0.082 


0.0802 


0.0794 


0.0851 


0.0892 


0.0813 






Stein 


0.0869 


0.0639 


0.0615 


0.0604 


0.0793 


0.0728 


0.0626 


n 


= 30 


RMSE(d B ) 


0.0788 


0.0669 


0.0626 


0.0611 


0.0642 


0.0882 


0.0652 






RMSE(ds) 


0.0691 


0.0516 


0.0475 


0.0477 


0.0525 


0.0607 


0.049 






Stein 


0.058 


0.0242 


0.0207 


0.0223 


0.0295 


0.0265 


0.0216 


n 


= 10 


RMSE(d B ) 


0.0973 


0.0889 


0.0911 


0.0906 


0.093 


0.1014 


0.0923 






RMSE(ds) 


0.0797 


0.0695 


0.0714 


0.0713 


0.0752 


0.0785 


0.0721 






Stein 


0.07 


0.0468 


0.0499 


0.0502 


0.0573 


0.0554 


0.0506 


11 


= 30 


RMSE(de) 


0.0641 


0.0513 


0.0535 


0.0533 


0.058 


0.0732 


0.0551 






RMSE(ds) 


0.0585 


0.0399 


0.0422 


0.0432 


0.0471 


0.0533 


0.0431 






Stein 


0.0452 


0.0151 


0.0176 


0.0196 


0.0214 


0.0214 


0.0183 


I 

11 


= 10 


RMSE(dg) 


0.0338 


0.0333 


0.0336 


0.0335 


0.0333 


0.0331 


0.0336 






RMSE(ds) 


0.0195 


0.0193 


0.0194 


0.0194 


0.0192 


0.0191 


0.0194 






Stein 


0.0017 


0.0016 


0.0016 


0.0016 


0.0016 


0.0016 


0.0016 


n 


= 30 


RMSE(d B ) 


0.0329 


0.0324 


0.0327 


0.0327 


0.0324 


0.0322 


0.0328 






RMSE(d s ) 


0.0187 


0.0184 


0.0185 


0.0185 


0.0183 


0.0182 


0.0185 






Stein 


0.0015 


0.0015 


0.0015 


0.0015 


0.0014 


0.0014 


0.0015 


I 

n 


= 10 


RMSE(d B ) 


0.119 


0.1012 


0.1006 


0.0991 


0.0996 


0.109 


0.1049 






RMSE(ds) 


0.1202 


0.082 


0.0818 


0.0811 


0.0822 


0.086 


0.0922 






Stein 


0.1503 


0.064 


0.0637 


0.0639 


0.0676 


0.0636 


0.0639 


11 


= 30 


RMSE(de) 


0.081 


0.0618 


0.0598 


0.0582 


0.0618 


0.0795 


0.0643 






RMSE(ds) 


0.0828 


0.0489 


0.0469 


0.0472 


0.0503 


0.0572 


0.0528 






Stein 


0.0825 


0.0223 


0.021 


0.0228 


0.0251 


0.0235 


0.0217 
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Table 3 

Measures of efficiency, with k = 3 and a — 0.1. RMSE is the root mean square error 
using either the Euclidean norm or the Procrustes size- and- shape norm, and "Stein" 

refers to the risk using the Stein loss function. The smallest value in each row is 
highlighted in bold. The mean has parameters Ai = 1,A2 = 0.001, A3 =0.001. The error 
distributions for Models I-IV are Gaussian (square root), Guassian (Cholesky), 
log-Gaussian and Student's t%, respectively 



S_b Sc Ss £if S_t S_r Sf 



I 



II 



n = 


1 n 


TJTV/TQT^/'J \ 


n nooo 

u.uyyy 


U.zoyo 


U.Uoy4 


U.Uo / O 


n 1 ni a 
U. 1U14 


(1 ciin 
U.Ollz 


U.uyz 






rUVlor^as ) 




n 01 vo 

U.Zl / z 


n 1 AO A 
U. 14Z4 


n 1 /ioi 


U.1U (A 


n QQ/i ^ 
U.oo4o 


C\ -\ A QO 






Stein 


53 4893 


28 1 505 


25.079 


27 7066 


1 2 4056 


15.2749 


25.497 


n = 


oil 




U.U * Uo 


U.zooO 


U.Uooz 


U.Uool 


n nsni 
U.UoUl 


U.oolo 


U.Uoo / 








U.ZU04 




U.loUl 


n 1 000 
U.looo 


U.Uo (V 


U.o4o4 


n 1017 






Stein 


53.3301 


25.8512 


22.2974 


25.378 


8.5161 


12.95 


22.6973 


n = 


10 


RMSE(d B ) 


0.0907 


0.4879 


0.0844 


0.0839 


0.1104 


0.75 


0.0861 






RMSE(ds) 


0.1669 


0.3571 


0.1139 


0.1176 


0.1023 


0.5168 


0.1151 






Stein 


34.2082 


9.8147 


15.4552 


16.4905 


10.2085 


8.6754 


15.7207 


n = 


30 


RMSE(d B ) 


0.0606 


0.5151 


0.0509 


0.0504 


0.0954 


0.7787 


0.0533 






RMSE(ds) 


0.1632 


0.3369 


0.1022 


0.1067 


0.0887 


0.5369 


0.1035 






Stein 


33.9321 


7.6303 


13.4332 


14.63 


7.9578 


7.4431 


13.693 


I 

n = 


10 


RMSE(d B ) 


0.0315 


0.0312 


0.0313 


0.0313 


0.0311 


0.0251 


0.0315 






RMSE(ds) 


0.0162 


0.016 


0.0161 


0.0161 


0.016 


0.013 


0.0162 






Stein 


0.0034 


0.0029 


0.0029 


0.0029 


0.0028 


0.0028 


0.0029 


n — 


30 


RMSE(d B ) 


0.031 


0.0307 


0.0309 


0.0309 


0.0306 


0.0244 


0.031 






RMSE(d s ) 


0.0156 


0.0154 


0.0155 


0.0155 


0.0154 


0.0123 


0.0156 






Stein 


0.0024 


0.0019 


0.0019 


0.0019 


0.0019 


0.0019 


0.0019 


I 

n = 


10 


RMSE(d B ) 


0.1055 


0.2519 


0.0848 


0.0819 


0.0895 


0.5214 


0.0933 






RMSE(d s ) 


0.2187 


0.197 


0.1253 


0.1301 


0.083 


0.3348 


0.1317 






Stein 


56.1488 


19.7674 


18.9143 


20.7028 


6.5634 


7.875 


17.4669 


n = 


30 


RMSE(d B ) 


0.0755 


0.2628 


0.0523 


0.0489 


0.0682 


0.5552 


0.0616 






RMSE(ds) 


0.2098 


0.186 


0.1089 


0.1161 


0.0635 


0.3455 


0.1106 






Stein 


53.9159 


16.9026 


15.701 


17.9492 


4.0551 


6.541 


14.9515 



In Table 3 we now consider the case Ai = 1, A2 = 0.001, A3 = 0.001, where 
£1 is close to being deficient in rank. It is noticeable that the estimators Ec 
and Sfl can behave quite poorly in this example, when using RMSE(cIe) or 
RMSE {ds) for assessment. This is particularly noticeable in the simulations 
for models I, II and IV. The better estimators are generally Eg and El, 
with Tie a little inferior. 
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Overall, in these and other simulations T,h,^s and have performed 
consistently well. 

7. Discussion. In this paper we have introduced new methods and re- 
viewed recent developments for estimating a mean covariance matrix where 
the data are covariance matrices. Such a situation appears to be increasingly 
common in applications. 

Another possible metric is the power Euclidean metric 

(10) d A (S 1 ,S 2 ) = -\\S?-S%\\, 

a 

where S a = UA a U T . We have considered a G {1/2, 1} earlier. As a — > 0, the 
metric approaches the log-Euclidean metric. We could consider any nonzero 
a G R depending on the situation, and the estimate of the covariance matrix 
would be 

C n 1 1 n 

t A = (A A ) 1/a , where A A = arginf £ \\Sf - A\\ 2 = - £ S i- 

Ki=l ) i=l 

For positive a the estimators become more resistant to outliers for smaller 
a, and for larger a the estimators become less resistant to outliers. For 
negative a one is working with powers of the inverse covariance matrix. 
Also, one could include the Procrustes registration if required. The resulting 
fractional anisotropy measure using the power metric (10) is given by 

r , k , k i 1/2 

m«)= ^£(a?-^)7£aH , 

l i=l i=l ) 

and \ a = tYa=i • A practical visualization tool is to vary a in order for 
a neurologist to help interpret the white fiber tracts in the images. 

We have provided some new methods for estimation of covariance matri- 
ces which are themselves rooted in statistical shape analysis. Making this 
connection also means that methodology developed from covariance ma- 
trix analysis could also be useful for applications in shape analysis. There 
is much current interest in high-dimensional covariance matrices [cf. Bickel 
and Levine (2008)], where k^$> n. Sparsity and banding structure often are 
exploited to improve estimation of the covariance matrix or its inverse. Mak- 
ing connections with the large amount of activity in this field should also 
lead to new insights in high-dimensional shape analysis [e.g., see Dryden 
(2005)]. 

Note that the methods of this paper also have potential applications in 
many areas, including modeling longitudinal data. For example, Cholesky 
decompositions are frequently used for modeling longitudinal data, both 
with Bayesian and random effect models [e.g., see Daniels and Kass (2001); 
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Chen and Dunson (2003); Pourahmadi (2007)]. The Procrustes size-and- 
shape metric and matrix square root metric provide a further opportunity 
for modeling, and may have advantages in some applications, for example, in 
cases where the covariance matrices are close to being deficient in rank. Fur- 
ther applications where deficient rank matrices occur are structure tensors in 
computer vision. The Procrustes approach is particularly well suited to such 
deficient rank applications, for example, with structure tensors associated 
with surfaces in an image. Other application areas include the averaging 
of affine transformations [Alexa (2002); Aljabar et al. (2008)] in computer 
graphics and medical imaging. Also the methodology could be useful in com- 
putational Bayesian inference for covariance matrices using Markov chain 
Monte Carlo output. One wishes to estimate the posterior mean and other 
summary statistics from the output, and that the methods of this paper will 
often be more appropriate than the usual Euclidean distance calculations. 
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